Differentially expressed heterogeneous overdispersion genes testing for count data

The mRNA-seq data analysis is a powerful technology for inferring information from biological systems of interest. Specifically, the sequenced RNA fragments are aligned with genomic reference sequences, and we count the number of sequence fragments corresponding to each gene for each condition. A gene is identified as differentially expressed (DE) if the difference in its count numbers between conditions is statistically significant. Several statistical analysis methods have been developed to detect DE genes based on RNA-seq data. However, the existing methods could suffer decreasing power to identify DE genes arising from overdispersion and limited sample size, where overdispersion refers to the empirical phenomenon that the variance of read counts is larger than the mean of read counts. We propose a new differential expression analysis procedure: heterogeneous overdispersion genes testing (DEHOGT) based on heterogeneous overdispersion modeling and a post-hoc inference procedure. DEHOGT integrates sample information from all conditions and provides a more flexible and adaptive overdispersion modeling for the RNA-seq read count. DEHOGT adopts a gene-wise estimation scheme to enhance the detection power of differentially expressed genes when the number of replicates is limited as long as the number of conditions is large. DEHOGT is tested on the synthetic RNA-seq read count data and outperforms two popular existing methods, DESeq2 and EdgeR, in detecting DE genes. We apply the proposed method to a test dataset using RNAseq data from microglial cells. DEHOGT tends to detect more differently expressed genes potentially related to microglial cells under different stress hormones treatments.

Application on murine alveolar macrophages dataset We perform the proposed DEHOGT method on another benchmark RNA-seq dataset [2], and compare the differentially expressed gene detection performance with the DESeq2 and EdgeR methods.Specifically, the RNA-seq dataset is collected through the Illumina NextSeq 500 platform in a study examining transcriptional changes in alveolar macrophages after performing reperfusion in murine lung transplants.There are 43430 genes in total, each with 12 normalized read count samples: 4 samples in the control group, four samples in the group of 2 hours post-reperfusion, and four samples in the group of 24 hours post-reperfusion.We examine the DE genes between the control group and 2 hours post-reperfusion group, and between the control group and 24 hours post-reperfusion group, respectively.We then select DE genes from the above three methods, given that the false discovery rates are controlled at 0.05 and the absolute values of logfold change are larger than 1.DE genes.In addition, we notice that genes Csf3r, Gatm, Tyms Gzma, Adora2b, Fads1, Fads2, Mvk, Ube2t, and Haus7 are uniquely identified by the proposed method DEHOGT, and several studies [3][4][5][6][7][8][9] find that these genes are significantly related to the alveolar macrophages.successful cell line authentication and Mycoplasma testing (Genetica, Burlington, NC).HMC3 cells (passage eight) were seeded in T-25 flasks with 2 x 105 viable cells and incubated at 37°C and 5% CO2.After 24 hours, the growth medium in each T-25 was replaced with one of the following treatments: dexamethasone (1 or 0.01 µM), hydrocortisone (10 or 0.01 µM), vehicle (ethanol alcohol) or control (untreated media).Cells were incubated in treatment media for three days at 37°C and 5% CO2 and imaged daily using the Axio Vert.A1 inverted microscope (Zeiss Oberkochen, Germany).At three days post-exposure (D3), cells were collected from each flask individually, quantified on the Countess II cell counter (Invitrogen Waltham, MA), seeded at 2 x 105 viable cells/flask in new T-25 flasks with normal growth medium, and incubated for three additional days (i.e., washout period).The remaining D3 cell suspension for each flask was divided equally between two microcentrifuge tubes, pelleted and washed with PBS.One cell pellet per flask was placed in -80°C storage for future DNA extraction; the remaining cell pellet underwent RNA extraction using the RNeasy Mini Kit (QIAGEN, Hilden, Germany) protocol adapted for the QIAcube automated system (QIAGEN).On the final day of the washout period (D6), cells from each flask were imaged, collected in suspension and then quantified on the Countess II.Cell suspensions were split equally into two aliquots and then prepped for nucleic acid extraction as described for D3.RNA samples from D3 and D6 were DNase treated (Dnase I kit; Sigma), quantified on the Qubit (RNA BR Assay Kit; Invitrogen) and scored for RNA integrity on the TapeStation (High Sensitivity RNA ScreenTape; Agilent).Library preparation was performed following the Illumina TruSeq Stranded Total RNA Library Prep Kit protocol (Illumina, San Diego, CA) with TruSeq RNA Single Indexes (Set A and B; Illumina).Library quantity and quality were assessed using the Qubit 1X dsDNA HS Assay (Invitrogen), TapeStation High Sensitivity D1000 ScreenTape (Agilent), and using the KAPA Library Quantification Kit (Roche Basel, Switzerland) for the LightCycler 96 (Roche).RNA sequencing was conducted on the NextSeq 550 (Illumina) using the High Output Kit with 76 paired-end cycles (Illumina).

Microglia cell experiment design
RNA-seq data preprocessing The summary of processing microglia RNAseq dataset, including the workflow, FastQC setup, and genome reference, is as follows.Specifically, the raw sequencing data from the Illumina NextSeq 500 system was quality-controlled using FastQC [10], version (v) 0.12.1.The paired-end reads (150 bp read length) were trimmed removing Illumina sequencing adapters within the sequences.Additionally, low-quality reads were excluded from downstream analysis applying a Phred quality score cutoff of ¿=30.The resulting reads were aligned to the reference genome (GRCh38/HG38) using STAR v 2.7.10 [11].Gene counts were generated using the FeatureCounts function of the R-package Subreads, v 2.0.1 [12].

Gene ontology analysis on Microglia cell dataset
We perform gene ontology analysis on the genes selected from DEHOGT.In summary, our analysis shows that the selected genes are functionally relevant.From the perspective of biological processes, the selected genes are enriched in the gene ontology categories, including regulation of the immune system process, regulation of response to stress and external stimulus, and regulation of response to stimulus, which are highly relevant categories to PTSD conditions.From the perspective of cellular components, the selected genes are enriched in the categories of the ribonucleoprotein complex and the Golgi complex, which are found to be relevant in neuro-degenerative diseases formulation [13][14][15].We add the detailed gene ontology analysis output in supplementary materials.

Fig 2 .
Fig 2. The area under the ROC curve from different methods when the read counts follow the negative binomial distribution with different overdispersion levels θ NB g .

Table 1 .
5. The numbers of detected DE genes are listed in the following Table1.Compared with DESeq2 and EdgeR, the proposed DEHOGT method selects more The number of selected DE genes from murine lung transplant RNA-seq data under different treatment comparisons.